greedy action
Heuristic Search for Multi-Objective Probabilistic Planning
Chen, Dillon, Trevizan, Felipe, Thiébaux, Sylvie
Heuristic search is a powerful approach that has successfully been applied to a broad class of planning problems, including classical planning, multi-objective planning, and probabilistic planning modelled as a stochastic shortest path (SSP) problem. Here, we extend the reach of heuristic search to a more expressive class of problems, namely multi-objective stochastic shortest paths (MOSSPs), which require computing a coverage set of non-dominated policies. We design new heuristic search algorithms MOLAO* and MOLRTDP, which extend well-known SSP algorithms to the multi-objective case. We further construct a spectrum of domain-independent heuristic functions differing in their ability to take into account the stochastic and multi-objective features of the problem to guide the search. Our experiments demonstrate the benefits of these algorithms and the relative merits of the heuristics.
- North America > United States > Oklahoma > Payne County > Cushing (0.04)
- Europe > United Kingdom (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning
Wan, Lipeng, Liu, Zeyang, Chen, Xingyu, Lan, Xuguang, Zheng, Nanning
Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration.
- North America > United States > Maryland > Baltimore (0.04)
- Asia > China > Shaanxi Province (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
A new soft computing method for integration of expert's knowledge in reinforcement learn-ing problems
Annabestani, Mohsen, Abedi, Ali, Nematollahi, Mohammad Reza, Sis-tani, Mohammad Bagher Naghibi
This paper proposes a novel fuzzy action selection method to leverage human knowledge in reinforcement learning problems. Based on the estimates of the most current action-state values, the proposed fuzzy nonlinear mapping as-signs each member of the action set to its probability of being chosen in the next step. A user tunable parameter is introduced to control the action selection policy, which determines the agent's greedy behavior throughout the learning process. This parameter resembles the role of the temperature parameter in the softmax action selection policy, but its tuning process can be more knowledge-oriented since this parameter reflects the human knowledge into the learning agent by making modifications in the fuzzy rule base. Simulation results indicate that including fuzzy logic within the reinforcement learning in the proposed manner improves the learning algorithm's convergence rate, and provides superior performance.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
qRRT: Quality-Biased Incremental RRT for Optimal Motion Planning in Non-Holonomic Systems
Pareekutty, Nahas, James, Francis, Ravindran, Balaraman, Shah, Suril V.
This paper presents a sampling-based method for optimal motion planning in non-holonomic systems in the absence of known cost functions. It uses the principle of learning through experience to deduce the cost-to-go of regions within the workspace. This cost information is used to bias an incremental graph-based search algorithm that produces solution trajectories. Iterative improvement of cost information and search biasing produces solutions that are proven to be asymptotically optimal. The proposed framework builds on incremental Rapidly-exploring Random Trees (RRT) for random sampling-based search and Reinforcement Learning (RL) to learn workspace costs. A series of experiments were performed to evaluate and demonstrate the performance of the proposed method.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.05)
- Asia > India > Telangana > Hyderabad (0.04)
Introduction to Reinforcement Learning (RL) -- Part 4 -- "Dynamic Programming"
Starting in this chapter, the assumption is that the environment is a finite Markov Decision Process (finite MDP). In this chapter we'll see how we can use DP algorithms to compute the value functions in a slightly different, less intractable way. The general idea is to take these 2 equations, and turn them into update rules for for improving the approximations of our value functions. It will make more sense later on. Policy Evaluation Policy evaluation means computing the state-value function Vπ for an arbitrary policy π.
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.56)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.44)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)
Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Hu, Hengyuan, Foerster, Jakob N
In recent years we have seen fast progress on a number of benchmark problems in AI, with modern methods achieving near or super human performance in Go, Poker and Dota. One common aspect of all of these challenges is that they are by design adversarial or, technically speaking, zero-sum. In contrast to these settings, success in the real world commonly requires humans to collaborate and communicate with others, in settings that are, at least partially, cooperative. In the last year, the card game Hanabi has been established as a new benchmark environment for AI to fill this gap. In particular, Hanabi is interesting to humans since it is entirely focused on theory of mind, i.e., the ability to effectively reason over the intentions, beliefs and point of view of other agents when observing their actions. Learning to be informative when observed by others is an interesting challenge for Reinforcement Learning (RL): Fundamentally, RL requires agents to explore in order to discover good policies. However, when done naively, this randomness will inherently make their actions less informative to others during training. We present a new deep multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this contradiction exploiting the centralized training phase. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. By combining this simple intuition with best practices for multi-agent learning, SAD establishes a new SOTA for learning methods for 2-5 players on the self-play part of the Hanabi challenge. Our ablations show the contributions of SAD compared with the best practice components. All of our code and trained agents are available at https://github.com/facebookresearch/Hanabi_SAD.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Actor-Expert: A Framework for using Action-Value Methods in Continuous Action Spaces
Lim, Sungsu, Joseph, Ajin, Le, Lei, Pan, Yangchen, White, Martha
Value-based approaches can be difficult to use in continuous action spaces, because an optimization has to be solved to find the greedy action for the action-values. A common strategy has been to restrict the functional form of the action-values to be convex or quadratic in the actions, to simplify this optimization. Such restrictions, however, can prevent learning accurate action-values. In this work, we propose the Actor-Expert framework for value-based methods, that decouples action-selection (Actor) from the action-value representation (Expert). The Expert uses Q-learning to update the action-values towards the optimal action-values, whereas the Actor (learns to) output the greedy action for the current action-values. We develop a Conditional Cross Entropy Method for the Actor, to learn the greedy action for a generically parameterized Expert, and provide a two-timescale analysis to validate asymptotic behavior. We demonstrate in a toy domain with bimodal action-values that previous restrictive action-value methods fail whereas the decoupled Actor-Expert with a more general action-value parameterization succeeds. Finally, we demonstrate that Actor-Expert performs as well as or better than these other methods on several benchmark continuous-action domains.
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Goal-oriented Trajectories for Efficient Exploration
Pardo, Fabio, Levdik, Vitaly, Kormushev, Petar
Exploration is a difficult challenge in reinforcement learning and even recent state-of-the art curiosity-based methods rely on the simple epsilon-greedy strategy to generate novelty. We argue that pure random walks do not succeed to properly expand the exploration area in most environments and propose to replace single random action choices by random goals selection followed by several steps in their direction. This approach is compatible with any curiosity-based exploration and off-policy reinforcement learning agents and generates longer and safer trajectories than individual random actions. To illustrate this, we present a task-independent agent that learns to reach coordinates in screen frames and demonstrate its ability to explore with the game Super Mario Bros. improving significantly the score of a baseline DQN agent.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)